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The Validity of Student Ratings: 
A Critique 

Steven V. Owen 
University of Connecticut 

Introduction , 

In their latest edition of Learning and Human Abilities . Klausmeier 
and Goodwin (1975» P. 174) insist that, beyond product, process, and 
presage criteria, "There are no other generally accepted criteria, or 
procedures, for evaluating the effectiveness of classroom teachers." 
Since student evaluations are not classified as product, process, or 
presage criteria, Klausmeier and Goodwin imply— by omission— that student 
ratings are not credible as sources of information about teacher effec- 
tiveness. This paper will attempt to establish that Klausmeier and 
Goodwin are correct. To be considered are such traditional features as 
validity and reliability, as well as worth, and political and ethical 
considerations. 

As Frey (1974) has noted, student ratings are currently enjoying a 
surge of popularity. The literature abounds with studies on student 
ratings, and often on the development of new rating scales. Yet we learn 
little from all of this literature, because new scales, and their often 
improper administration, rarely resolve the problems of old scales. For 
some observers, the increased use of student ratings as "measures" of 
teacher effectiveness has implied an increased acceptance of this form 
of assessment. A closer look, however, suggests that there is abundant 

4 



confusion and occasionally, skepticism about the meaningfulness, validity, 
and usefulness of student ratings. Some researchers have pointed out that 
the utility of student ratings depends on the purpose of the ratings. 
Doyle and Whitely (1974), for example, proposed that the interpretation of 
ratings depends upon whether they are intended for personnel decisions or 
for diagnostic purposes. Yet extremely little is known of the circumstances 
or conditions which permit useful decisions about student ratings (cf., 
HcKeachie, 1973). 

Some administrators, instructors, and researchers support strongly 
the use of student ratings for all purposes. Proponents have claimed, 
glibly, that those who dislike student ratings are simply reacting on the 
basis of a generalized fear of being evaluated. However, a rigorous ap- 
praisal of research and theory shows enou^ inconsistency, methodological 
shortcomings and naive acceptance of student ratings to cause genuine 
trepidation about their use as evaluative instruments.. The purpose of 
this paper is to outline several of the most common problems in the re- 
lated literature; these problems, I think, have contributed most to our 
lack of understanding about student ratings of teacher effectiveness. 

Grades Awarded . 

A plethora of studies have faced the question about how ratings are 
related to grades awarded, but the literature is replete with contra- 
diction. A basic problem is that the relationship between grades and 
ratings may depend on which comes first, (it should be noted that few 
studies report whether greides or evaluations come first.) It seems reason- 
able that grades may have a greater influence on ratings if the ratings 
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are done after grades are awarded. If there are situational events that 
influence a student '6 feelings about a course or instructor, an unexpected 
grade may well change those feelings. Holmes (1972) found a powerful low- 
ering of ratings after students were given a grade which was lower than 
they expected to get. Holmes' conclusion is that we should keep students 
"adequately informed of their proficiency, [so that] the possibility of 
disconfirmed expectancies will be decreased. .. ."(p. 133). This suggestion 
refers to surprise grades at the end of the semester; but some students 
get surprises on exams throughout a course, which may well affect their 
ratings of the instructor. Bausell and Magoon (1972) supported this com- 
ment, showing that students whose grade expectancies decreased during a 
course also lowered their ratings of the course and instructior. Kennedy's 
(1975) study confirms the relationship between expected grades and student 
ratings, although the study was limited to a single course (across I5 
sections). 

Using a multivariate design, Lolli and Owen (1976) and Bausell and 
Magoon (1972) found significant differencies in ratings between three 
groups of students: Those whose expected course grade was lower than 
their CPA, those with congruent expected grade and CPA, and those whose 
expected grade was higher than their CPA. As hypothesized, the discre- 
pancy between expected grade and CPA is a potent intervening variable in 
ratings. A superficial solution is to omit from rating summaries all 
the "discrepant" students, euid examine only the ratings of "non-discrepant" 
students. We have yet to see evidence, however, that this middle group of 
students provide more valid ratings than are otherwise obtained. 

The potentied for student "whimsy" is compounded when, as is sometimes 
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the case, ratings are collected after the course via mailed question- 
naires. A return rate of 50 percent (Brown, I974) does not guarantee 
a representative sample of all students in a course. It may be that 
those motivated to return the rating questionnaires are those with the 
strongest feelings about the course. While this outcome may ensure 
ratings at both ends of the like-dislike continuum, it may also under- 
represent those with moderate likes and dislikes. The literature does 
not provide answers about the representativeness of voluntary question- 
naire returns of student ratings. 

The interpretation of a correlation between ratings and grades 
awarded is difficult. If students receiving higher grades tend to rate 
instructors more highly, it is possible that the instructor is, in fact, 
more effective for them. Another viewpoint is that the effectiveness of 
an instructor is a "truth," and student variation in ratings represents 
error variance in interpreting the "truth" (Deshpande et al., I970). 
Nevertheless, the biasing influence of grades has not been resolved in 
the literature. 

Sub.iect Matter Content or ]ia.1or Area of Study . 

There are only a few studies on the relationship between course con- 
tent, or major area, but there appears to be consistency in the findings. 
In particular, the nature of the content area (Veldman and Peck, I969), 
or the student's major area (Slater, 1974i Paulus, 1973} Remmers, 1963; 
Centra, 1973a, b; Kennedy, 1972) influence student impressions of teachers. 
Veldman and Peck (1969) appear to dismiss this influence, commenting that 
the nature of course content explains only a small portion of ratings 
variance, and that prior self-selection of teachers into content areas 
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"undoubtedly" relates to differential ratings. Nevertheless, such minor 
variables continue to erode the trustworthiness of ratings. 

Agreement Between Student Raters and Other Raters . 

Several studies have examined the level of evaluative agreement be- 
tween students and other sources of ratings. Tolor (1973), for example, 
compared high school students' choice of "most" and "least" effective 
teachers with administrators, other faculty, and parent choices. He 
found moderate agreement among groups about the effective teachers, but 
students labeled as "ineffective" a quite different set of teachers than 
did the other groups. Centra (l973b) compared teachers' self-evaluations 
with student ratings and found little agreement (median correlation - .21). 
In addition, the college instructors in his sample rated themselves better 
than students did. The widely-read review of Costin et al . , (1971) gives 
other studies showing low to moderate relationships between student ratings 
and others' ratings. Interestingly, Costin et ai. view this finding as 
"support [for] the contention that student ratings have a contribution of 
their own to make in the evaluation of teaching" (p. 517). A more cynical 
perspective is that students are not only different in their ratings, but 
also no better! 

Student Learning * 

Many have insisted that student achievement is the most decisive ex- 
ternal criterion for ratings. But there is no agreement in the literature 
about the relationship between these two factors. While many studies have 
reported low, positive relationships, others (see Turner and Thompson, 1974) 
have found negative or no relationships. Costin et al . , (1971) and Kulik 
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and McKeachie (1975) provide a representative sample of these studies. 
Recently, some researchers have decided to publicly debate the issue 
(Rodin and Rodin, 1972; Rodin, 1973? Prey, 1973, 1974). Although both 
factions would disagree^-wrth the myriad differences in their research 
methodologies permits a decision of "no decision." 

One of the great difficulties in the use of student achievement as 
a criterion is multiple meanings. Rosenshine's otherwise excellent work. 
Teaching Behaviors and Student Achievement (197I) has been roundly criti- 
cized (Gall, 1973) because of vagueness in defining "achievement." The 
Rodins-Frey debate is striking because of its lack of attention to validity 
and reliability of their achievement measures. Any research using student 
achievement should explicate clearly the meaning, context, validity, and 
reliability of such measures. 

Halo Effect . 

Some researchers have discounted a generalized halo effect^ running 
through a set of teacher rating items, usually on the basis of factor 
analyses which reveal several separate dimensions underlying a complete 
scale. However, factorial validity and stability do not necessarily pre- 
clude a halo effect; in fact, another interpretation is that the halo 
effect is merely multidimensional! Other theory and research (Widlak 
et al. , 1973) gives evidence for the notion of the halo effect in ratings, 
Cronbach (1958) suggested that raters often carry internal sterotypes (he 
called it "implicit personality theoiy") about clusters of attributes 



Other researchers have developed euphemisms for the halo effect 
student ratings. Aleamoni (1973) » for instance, called it the 
. "general oourse attitude." 



that people are "supposed" to have. Implicit personality theoiy can thus 
be an explanatory mechanism for the halo effect* Support for implicit 
personality theory's influence on ratings comes from a variety of sources* 
Person-perception research, for instamce, would predict that students 
would rate lower those instructors whos€ attitudes were perceived to be 
different from those of the students. Good find Good (1973) and Levensen 
and LelJnes (1974) found this to be the case* 

Passini and Norman (1966) found that highly similar factor struc- 
tures emerged for two groups of raters: one group knew the people they 
were rating; the other group did not* Their implication that a priori 
impressions reduce rater objectivity formed the basis for later research 
by Magoon and Price (1972)* Magoon and Price found congruence in rating 
factors between students who rated their instructors before the course 
began, and students who rated instructors after the course* Oddly, they 
discounted the halo effect as an explanation for this finding* Rather, 
they said, the "item relationships [seemed to be based on raters'] previous 
experience with other instructors" (p* 9)« Nevertheless, they conclude 
(p* 9) that ratings maj' suggest more about "preconceptions of students 
than about real differences between courses and instructors*" Another 
way of stating this conclusion is that interrater reliability ma y tell 
us more about the consistency of classification schemes than it does 
about the actuakl effectiveness of insttiictor * 

Whitely and Doyle (1975) ^^ave supported this assertion by examining 
the congruence of factor analytic diniensions across raters, courses, and 
instructors* Most striking in their study was the correspondence between 
underlying dimensions of actual ratings, and clusters of rating items 
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whic^;. students were asked to group into homogenous sets. Ghiselli and 
Gniaelli (1972) have perhaps most clearly summed up the influence of 
implicit personality theory: 

[T]he report the rater makes about a stimulus person is not a 
faithful reflection of the qualities that person possesses or 
manifests, but rather is a report of his impression of that 
person, a description of his mental reaction to him. This 
reaction, of course, is conditioned by his social and cultural 
background (p. 2J0). 

Course or Clasis Level and Course Size . 

Some research suggests that systematic differences in instructor 
ratings occur as a function of the class level of students. Tolor (1973) 
found that high school students' judgments about teachers were related to 
the students' class level (e.g., sophomore, junior, etc.) Aleamoni and 
Graham (1974) discovered similar outcomes at the college level. Class 
size has been shovm to influence ratings in a negative fashion. The 
larger the course, the lower the ratings (McKeachie, 1975; Paulus, 1973, 
Klafehn, 1975; Scott, 1975). 

Reliability . 

Almost all recent studies on student ratings appear to dismiss re- 
liability quickly and cavalierly, by one of two methods. First instead 
of calculating a reliability estimate for the measure used, one can refer 
to, say, Costin, Greenou^ and Menges' (l97l) review to find ample support 
for the reliability of such measures. Second, researchers can calculate 
their own estimates for a measure at hand. In either case, reliability 
estimates tend to be flawed, because the same erroneous techniques con- 
tinue to be used. 
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Medley and Mitzel (1963, p. 253) correctly pointed out that "the 
term reliability coefficient refer[8]to the correlation to be expected 
between scores based on observations made by different observers at 
different times" (italics added). Rarely do we see this type of co- 
efficient used.-^ Rather, we hear of "stability" estimates (e.g., student 
rankings now vs. student rankings a year later) and internal consistency 
estimates. Costin et al., (1971) reported rather high stability estimates 
.48 to .89. The magnitudes are about that high in Bausell and Magoon's 
(1972) correlations between first day and last day ratings. However, 
Bausell and Magoon acknowledged that the high stability may mean either 
accuracy of ratings or durability of student bias. 

Medley and Mitzel (1963) cautioned that a halo effect will add com- 
mon variance to "different" rating items, — a scale must necessarily — 
and spuriously — ^build internal consistency. Also, they remarked, to the 
extent a halo effect is persistent over time, stability estimates will be 
inflated. The halo is likely to be detected by hi^ correlations among 
items attempting to measure conceptually different teacher behaviors. As 
such correlations are fairly easy to find, one wonders about the level of 
exaggeration in "reliability!" estimates of qtudent ratings. 

Another disconcerting influence of rating scale reliabilities occurs 
when rating scales are examined for their relationship to such variables 
as student and teacher characteristics. Often, in studies of how student 
and tefiujher personalities affect ratings, several measurements are re- 
gressed against some rating criterion. Crush and Coatin (1975) and 



^Kulik and HcKeachie (l975i PP. 222-223) give a few examples of 
such estimates; the range shown is .34 to .67. 
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Treffinger and Peldhusen (l970) provide examples of this type of Malysis. 
Tref finger and Peldhusen found modest multiple correlations (about .40) 
between a battery of student characteristics, and end of course ratings. 
They concluded that student variables "only" account for 21 percent of 
the criterion variance. Since the reliability of the criterion sets an 
upper boundary on its predictability (Cureton, 1965, p. 344)^, Treffinger 
and Peldhusen •s comment about 21 percent of the criterion variance is mean- 
ingless until we know how much criterion variance is reliable and thus pre- 
dictable. Should Treffinger and Peldhusen* s criterion have a reliability 
estimate of only ,50, then they have actually accounted for 42 percent of 
the predictable criterion variamce. Obviously, reliability estimates can 
change our ideas about the magnitude of relationships between ratings and 
other variables. 

The Politics of Evaluation . • 

There is no question that student evaluations carry political over- 
tones. Teacher organizations and unions are perhaps the most vocal op- 

2 

ponents of student ratings . If "excellence" in teaching is decided by 
students, emd maintained by a reward system, it is feared that excellence 
may deterioriate to subservience: those who control merit rewards are 
in a position to "call the shots" about how teachers should behave 
(Bolton, 1972). Also, the emotional dimension of evaluative ratings 

^"Variance accounted for" is the following proportion: 



reliability of criterion 

2 

They also have opposed most other types of teacher Gvaluationi see 
Selden's (l969) American Pederation of Teachers position paper on 
evaluation. 
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often produces tension, hostility, and strain in interpersonal relation- 
ships (Gruenfeld and Weissenberg, 1966} Kerlinger, I971). 

The American Federation of Teachers (Selden, I969) has claimed that, 
beyond an initial probationary period, evaluation of teachers is not a 
legitimate means of improving education, Bolton (l972) finds this atti- 
tude somewhat akin to disregarding the performance of a baseball player 
after he has played for a couple of years. Perhaps teachers are not 
ideologically ready to accept a critical evaluation of their classroom 
performance. We have been free from rigorous evaluation for a long time. 
Postman and Weingartner's proposal (l969» P. 139) that students should 
"classify teachers according to their ability" was once laudable; today 
the laughter has a nervous ring to it. Even if teachers are assured that 
student ratings are "merely" measures of satisfaction, their fears are 
not allayed. Teachers are afraid of students doing the evaluating, but 
as I have tried to establish in this paper, their fears are not entirely 
groundless. A cursory review of the literature is enough to make most 
of ufl stand in awe at the confusion surrounding student ratings. 

An Immodest Proposal . 

If we are to make more sense of student rating instruments, and the 
scores derived from them, I believe that we should begin a threefold ap- 
proach. The three steps would seem to logically follow the sequence 
presented below. 

First, I would propose a moratorium on student ratings as evaluative 

measures.^ Admittedly, they may be one of the best available sources of 

^Given the evidence about teacher preparedness in assessing student 
performance, maybe there should be a moratorium on all types of 
school evaluation. ..see Roeder (1973) • 
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information about teacher competence (compared to such devices as admin- 
istrator scuttlebutt, peer judgments, and self-rating). But the lack of 
clear meaning or validity in student ratings invites misuse and continued 
disagreement about their worth. Discontinuing student ratings seems to 
run counter to the ever mounting press for accountability; but it at 
least provides an opportunity to clear some, of the evaluative smog that 
has been blurring our vision and stinging our sensibilities. 

Second, we need to relearn some old lessons on important properties 
of rating scales. This remark implies that existing instruments need to 
be refined until they satisfy several minimal criteria outlined by Remmmers 
(1962, p, 330): 

1, Ob.iectivity , Use of the instrument should yield verifiable, 
reproducable data not a function of the peculiar characteris- 
tics of the rater, 

2, Reliability , It should yield the same values, within the 
limits of allowable error, under the same set of conditions,,,. 
This criterion boils down to the accuracy of observations by 
the rater[sj, ,,, 

3- Sensitivity , It should yield as fine distinctions as are 

typically made in communicating about the object of investi- 
gation, 

4» Validity , Its content, in this case the categories in the 

rating scalo, should be relevant to a defined area of investi- 
gation and to some relevant behavioral science construct; if 
possible, the data should be covariant with some other, experi- 
mentally independent, index, o,» 

5» Utility , It should efficiently yield information relevant to 
contem.oorary theoretical and practical issues. 

Finally, student ratings should be studied hard and long. This sup- 

ports the idea that their use, for the time being, must be experimental 

and not evaluative . There is enough exploratory research to give us some 

good ideas about programmatic research. Here, then, are a few suggestions 
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for research directions: 



1. What is the influence of rater anonymity? We keep pretending 
that students will change their ratings if they have to reveal 
their identity. The caution of anonymity in ratings should be 
build on direct evidence not intuition,^ 

2. What are the differential effects on ratings when they are 
done before vs. after grades are awarded? Do these effects 
interact with student, teacher, or subject matter characteris- 
tics? 

3. What is the criterion-related validity of ratings, using a 
variety of criteria, such as residualized student achievement, 
student Affect toward course content, teacher behavioral change, 
student social behavior, and direct observations of teacher 
behavior by trained observers? Bolton (l972). Turner (1973) 
and Smith (1974) have proposed the use of "jury" models for 
weighting the various sources of evaluative information. 
Similarly, the conglomerate of criteria proposed here can be 
used as a multivariate view of the outcomes of teacher behavior. 
A variety of multivariate techniques are available to handle 
source, method (Halstead, 1970), and outcome variables; multiple 
regression, discriminant analysis, factor analysis, canonical 
analysis, and multiple amalysis of variance and co variance 
methods have been used to rarely. 

4» How are different rating formats related to other external 
criteria? There is some evidence that format changes produce 
rating changes (Pollman et ad., 1974), as well as evidence 
that they do not (Froman, 1976). Are there certain circum- 
stances, (i.e., types of students,. or types of rating items) 
which interact with fonnat to produce hi^er or lower ratings? 

5» Under what circumstances does provision of evaluative feedback 
help teachers improve? Is there a difference, for example, 
between the informational "worth" of hi^ inference or low 
inference feedback? Many studies have addressed the issue of 
feedback; Trent and Cohen (1975 ?• IO46) provide a good review. 
But it is not yet clear under what circumstances feedback im- 
proves instruction. Again, multivariate techniques allow the 
simultaneous consideration of many possible interactions between 
teacher, student, t.nd school characteristics; content or subject 
area; type of feedback; and regularity of feedback. 

6. What is the degree of interaction, between teacher evaluation 



One unpublished research report (Anon., n.d.) supports the common 
view that identified student ratings will be higher than anonymous 
responses. 
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procedures and student evaluation procedures? For instance , 
does a teacher's use of norm-referenced vs. criterion- 
referenced grading system influence student ratings? 

?• Can students be trained to be (more) objective raters? That 
* is, can they be trained to make judgments about teacher be- 
havior which are apart from, but not necessarily inconsistent 
with, "consumer satisfaction?" What types of students are 
most objective in rating what types of teac.hers? Is it 
possible for students to employ a common, consistent frajoe of 
reference? (That they currently do not is suggested by the 
research of Sanders and Lynch (1973)) • 

8. How can we build rating instruments that distinguish the 
middle ranges of teaching ability as well as the very good 
and the very poor? 

9. Is it possible to build a rating system which balances the 
teacher attributes that students value with tuacher character^ 
is tics that produce good learning? 

10. Given the evidence that teachers show only moderate stability 
in producing student learning (Brophy, 1973), can we expect 
student ratings to show only moderate stability? (The stabili^;)r 
issue implies an important but ordinarily forgotten rule of 
good research: replication. ) What are the implications for 
ratings if broad intrateacher fluctuations are found? What ar^ 
the implications if factorial invariance (or stability) of 
rating scales is not found (Villano and Rosenstock, 1973)? 

11. Are the future answers to any of the previous qusstions modera-^^^ 
by the purpose of teacher evaluation, or can the findings be 
generalized across purposes? 
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